
STOP 



Early Journal Content on JSTOR, Free to Anyone in the World 

This article is one of nearly 500,000 scholarly works digitized and made freely available to everyone in 
the world by JSTOR. 

Known as the Early Journal Content, this set of works include research articles, news, letters, and other 
writings published in more than 200 of the oldest leading academic journals. The works date from the 
mid-seventeenth to the early twentieth centuries. 

We encourage people to read and share the Early Journal Content openly and to tell others that this 
resource exists. People may post this content online or redistribute in any way for non-commercial 
purposes. 

Read more about Early Journal Content at http://about.jstor.org/participate-jstor/individuals/early- 
journal-content . 



JSTOR is a digital library of academic journals, books, and primary source objects. JSTOR helps people 
discover, use, and build upon a wide range of content through a powerful research and teaching 
platform, and preserves this content for future generations. JSTOR is part of ITHAKA, a not-for-profit 
organization that also includes Ithaka S+R and Portico. For more information about JSTOR, please 
contact support@jstor.org. 



5] (Scope and Method of Statistics. 229 

SCOPE AND METHOD OF STATISTICS. 

By Harald Wbstergaard. 

I. 

It might appear absurd to discuss the scope and method of 
a science with at least 250 years of history, an evolution from 
the trifling beginning as "political arithmetic" to its present 
position as an important, or rather an indispensable, instru- 
ment in the complicated machinery of modern government 
and an auxiliary science in many branches of human thought. 
It might be supposed that no unsettled questions about scope 
and method were left, that on the contrary the tradition of 
many years would enable the great army of statisticians to 
follow a definite plan, without any doubt whatever about 
their final goal. 

Still it is a fact, that there are at present not one, but 
several corps of statisticians, each trying earnestly to promote 
the science, but hardly able to cooperate for lack of mutual 
sympathy and sometimes acting in direct opposition to one 
another. 

I hope, therefore, that it will not appear entirely useless 
to try to throw some little light on this problem. In doing so 
I shall first give a brief outline of some essential points in the 
evolution of statistics. 

In the seventeenth and eighteenth centuries we find three 
entirely different lines of evolution, viz., the calculus of proba- 
bilities, "political arithmetic," and the comparative de- 
scription of states, the last discipline being the only one to 
which the name of statistics was appropriated, though this 
one had originally very little to do with statistics in the 
modern sense of the word. 

Statistics as the comparative description of states was culti- 
vated by Aristotle in his famous Constitutions of 158 states, 
of which only the Constitution of Athens has survived. To 
the same category belong the works of the Italian authors 
Sansovino and Botero, dating from the end of the sixteenth 
century, those of their contemporary in France, Etienne 
Pasquier, who wrote Recherches de la France, and those of 
Pierre d'Avity, who published his great work, Les Estats 
du Monde, a little later (1614). 



230 American Statistical Association. [6 

It was, however, in Germany chiefly that this study was 
cultivated. The well-known polyhistor, Hermann Conring, 
lectured on it for many years in the seventeenth century, 
beginning his description with Spain and ending it with Japan, 
Morocco, and Abyssinia; and in the eighteenth century we 
meet in Germany a whole series of writers from Achenwall to 
Schlozer, dealing with this theme. Most of these works had 
a family resemblance containing descriptions of the several 
countries, their climate and population, their produce and con- 
stitution, etc., but with hardly any numerical data; for these 
German authors did not take much interest even in the scanty 
numerical material which was at hand. Sometimes these 
descriptions were printed in a schematic form, giving under 
each heading for each country a short verbal description; in 
this way the so-called "Tabellen-Statistik" arose. 

These tabular descriptions might be looked upon as signs of 
degeneration; but, clumsy and naive as they often were, they 
actually indicate progress. For gradually the authors of 
Tabellen-Statistik began to fill their paragraphs with nu- 
merical facts, as material of this kind accumulated. In par- 
ticular comparative tables with facts relating to population 
were introduced. One of the most important contributions 
of this kind was written by Crome (1785), with tables con- 
taining the number of inhabitants, area, density of population, 
and the "possible number," that is to say, the population 
assuming an average density of 3,000 inhabitants per (German) 
square mile. Crome tried to criticize in a really scientific 
spirit all the figures at hand and to make the best possible 
estimates. Thus a bridge was built between this discipline 
and statistics in the modern sense of the word, these compara- 
tive descriptions of various countries being increasingly in- 
fluenced by the growing abundance of numerical material. 
This process was not without its dramatic features. At the 
beginning of the nineteenth century an interesting discussion 
over the meaning of statistics took place in some learned 
German publications; the views of the antagonists seemed 
quite irreconcilable; each side maintained that its conception 
of statistics was the true one. It was maintained that the 
adherents of the Tabellen-Statistik simply "skeletonized" 
the noble science of statistics, that they were Tabellenknechte, 



7] Scope and Method of Statistics. 231 

unimaginative "slaves of the tables." As in so many other 
scientific controversies, there was never a formal conclusion; 
by and by the discussion died away, and gradually the name 
statistics was transferred from this discipline to political 
arithmetic, as formally proposed in a masterly essay, written 
in the middle of the century by the famous German economist, 
K. Knies. 

The foundation of the calculus of probabilities, the second 
line of evolution, was laid by Italian and French mathema- 
ticians, who were interested in the problems of games at dice, 
lotteries, etc. This discipline reached a very high develop- 
ment in the eighteenth century, culminating at the beginning 
of the nineteenth century in Laplace's masterly Theorie 
analytique des probability's (1812). 

The chief problem of the calculus of probabilities was 
discussed in J. Bernoulli's posthumous work, Ars conjectandi 
(1713) and further developed by several brilliant mathema- 
ticians. According to Bernoulli's theorem, the probability 
of a certain deviation from a given type will depend on the 
number of observations, so that by increasing the number of 
observations it is possible to get as narrow limits of deviation 
as we like. This is the famous " law of large numbers, " which 
has had so prominent a part in many statistical discussions. 
In fact, if it is possible to calculate the limits of deviation from 
the standard, we can easily control statistical conclusions. 
If the average rate of infant mortality is, say, 0.20, and if in 
a group of 10,000 infants a rate of 0.21 is very rarely reached, 
in a group of 1,000,000 infants a rate of 0.201 will be equally 
seldom found and if there are 100 millions of observations, 
the limit is reduced to 0.2001. 

Although the mathematicians clearly understood the 
significance of Bernoulli's theorem, they failed to exhaust it 
fully. Most of the mathematicians who developed the calcu- 
lus of probabilities seemed to suppose, as an axiom which 
needed no test, that its theorems would hold good everywhere. 
Those theorems can in fact be applied to a goodly number of 
cases, not the least in games at dice, lotteries, etc; but the 
important question was left untouched, whether and under 
what conditions these theorems might be applied to the more 
complicated problems in vital or economic statistics. No in- 



232 American Statistical Association. [8 

vestigation was made to test the applicability of the theorems 
in this field ; hardly even was an attempt made with regard to 
the original field of the calculus of probabilities, in spite of the 
severe but rather unjust criticism by d'Alembert, who directly 
appealed to mathematicians to make experiments with regard 
to games in order to find the real laws of chance. Many 
questions of this kind still remain unsolved. 

The third line of evolution began in England. The cradle 
of political arithmetic is to be found in London, where a mer- 
chant, John Graunt, in 1662 wrote his remarkable Natural 
and Political Observations upon the Bills of Mortality. This 
treatise is a keen and ingenious attempt to calculate a life 
table from the London bills of mortality, on the basis of the 
returns of cause of death only, no statement of age at death 
being made. Graunt's results probably were wide of the 
truth; thus he supposed that three eighths of the persons 
surviving at six years would die before reaching sixteen, and 
again, that the same proportion would die between sixteen 
and twenty-six. Probably the actual death rate was only a 
fraction of this estimate. But still we must admire Graunt's 
acumen and his grasp of these highly important problems, 
which in the following generations were to be studied by a 
number of keen investigators, including the brilliant Halley 
and Lavoisier. 

Halley's Breslau table is the next step in advance, a real 
seven-league stride, hardly appreciated by his contempo- 
raries. His paper, published in 1693, contained in nuce a 
whole theory of vital statistics and life insurance. And not 
only did he grasp the problems thoroughly, but there is also 
reason to believe that his numerical results were fairly correct, 
in spite of the imperfections of his data. Thus, the total 
number of inhabitants which he calculated seems to have 
been fairly close to the truth, and so, too, his life table; it 
gave a good picture of the chances of death in a non-epidemic 
period in those times, though mortality in years of plague 
might, indeed, rise to a much higher level. 

In the following century several mortality tables were con- 
structed, often based on much more complete observations 
than were at Halley's disposal; thus by Struyck and Kersse- 



9] Scope and Method of Statistics. 233 

boom in Holland and Deparcieux in France, the latter for 
instance calculating most interesting life tables for monks and 
nuns. Not less important was the progress in Sweden (1748), 
where a whole system of vital statistics was founded, with a 
population register in each parish and complete lists of births 
and deaths. The first reliable mortality tables for a whole 
country were calculated by Wargentin from these data. 

How rich the literature of political arithmetic had grown in 
the course of a century appears from Sussmilch's compilation 
in his great work, Die gottliche Ordnung (2nd ed., 1761). 
His work contained, moreover, several original contributions, 
including a life table which enjoyed a great reputation in 
spite of conspicuous defects, obvious even to his contempo- 
raries. 

Halley's contemporaries, King and Davenant, had chiefly 
economic statistics in view. We owe to Davenant a first 
attempt at a theory of political arithmetic. (Of the Use of 
Political Arithmetic, in all considerations about the revenues 
and trade, 1698.) A single quotation will suffice: "As for 
example, when the number of inhabitants in England is known, 
by considering the extent of the French territory, their way 
of living, and their soil, and by comparing both places, and 
by other circumstances, a near guess may be made how many 
people France may probably contain." Statistical conclu- 
sions were very often drawn in this way in those days. Un- 
fortunately the keystone was wanting; Davenant failed to 
explain how to avoid erroneous conclusions or how to find 
the limits of probable error. 

In France several renowned authors of the eighteenth 
century grappled with the problem of calculating the number 
of inhabitants. So Messance, who, in his Recherches sur 
la population (1766), found a population of 24,000,000; Moheau 
arrived at a. similar result (1778). Their method is easily 
understood. Moheau finds, for instance, in representative 
districts, that each year on an average there would be one 
birth to every 25| inhabitants; the average number of 
births in France being about 929,000, he estimates the popu- 
lation at 23,700,000. The number of deaths are in Moheau's 
opinion less reliable as a basis for estimate, though in the 



234 American Statistical Association. [10 

present case, by using the deaths, he was able to reach nearly 
the same result. 

Lavoisier made a step forward in political arithmetic (1791) ; 
taking Moheau's method of estimate from representative 
districts, he applied it to agricultural statistics, and calculated 
on the basis of representative observations the number of 
ploughs and of horses and cattle, as well as the harvest and 
the cultivated area of the country. 

Ingenious as these calculations often were, they were some- 
what unreliable and arbitrary, nevertheless, for it seemed 
hardly possible to check the results. Laplace pointed in the 
right direction, basing his calculations of the number of in- 
habitants on figures of population and births in selected 
districts in 30 departments. His estimate of the number of 
inhabitants was probably nearer the truth than the previous 
estimates. But even he did not reach the bottom of the 
problem; he did not prove that these observations were grouped 
around a typical value according to the calculus of probabil- 
ities. After Laplace not much was done to develop the 
theory of representative statistics or of indirect methods of 
finding the numbers required. This is probably connected 
with the fact that direct methods of complete enumeration, 
such as a census covering an entire country, gradually became 
popular in spite of the great distrust with which at first they 
were regarded. In the earlier period of statistics this dis- 
trust was very characteristic. Davenant preferred the books 
of the Hearth Office; Messance would even rely on statistics 
of the yearly consumption of goods as a basis for estimating 
the number of inhabitants. Even as late as the middle of 
the nineteenth century, there was in France and Belgium a 
lively discussion of the advisability of using census results 
in calculating mortality tables, several well-known statisti- 
cians trying to do without the census. 

The census held its ground, however, and no modern stat- 
istician would doubt the general reliability of enumerations 
of this kind. Only in calculating the infant mortality most 
statisticians still prefer the births to the census results, but 
even here a tendency toward using census figures is percepti- 
ble. 



11] Scope and Method of Statistics. 235 

At the beginning of the nineteenth century the outlook for 
statistics was rather favorable. Statistical offices were es- 
tablished in France and Prussia, a census was taken in the 
United States in 1790 and again ten years later. But in 
Europe the long war and the political reaction after the 
conclusion of peace delayed progress, so that the new era 
could not begin until about 1830. From that time on there 
has been continuous and rapid progress. The enthusiasm of 
those early years was astonishing: statistical societies were 
founded; journals like Annales d'hygi&ne publique or the 
Journal of the Statistical Society of London published numerous 
statistical articles; statistical offices were founded or reorgan- 
ized. This enthusiasm has waned a little, but nothing has 
prevented a steady progress, an immense accumulation of 
numerical data, of reports, enquttes, etc. Under this 
rapid evolution statistics could not be altogether free from 
dilettanteism; it is even surprising that the number of such 
untrained workers has been so small. 

Quetelet (1796-1874) may be considered as the typical 
statistician of his period. His enthusiasm, his readiness to 
draw conclusions from a rather meagre collection of material, 
his faith in statistics as the queen among all sciences, his 
theory of the great regularity in all statistical phenomena — all 
these features were common to many statisticians of his time, 
though they lacked the impress of his brilliant genius and his 
wonderful style. It is easy now to see his defects: Quetelet 
was too willing to believe in the constancy of types; he was 
fully convinced, for instance, that mortality did not change, 
though, even while he was writing, a considerable decrease 
of mortality was beginning. He saw too much in his "homme 
moyen," that remarkable being, who should unite all the 
qualities of the nation, as a representative of beauty and 
harmony; he was too optimistic with regard to international 
statistics and the usefulness of congresses. All this prevented 
him from seeing the variety which prevails everywhere, the 
constant changes in all conditions of human life. These 
mistakes were not peculiar to him. Mathematicians sought 
to find a formula by which mortality might be expressed as a 
function of age, a physiological law of mortality, so to speak. 



236 American Statistical Association. [12 

The regularity of all statistical data being an axiom, it did 
not occur to the statisticians of the period that a thorough 
investigation into the conditions and the limits of this regu- 
larity was necessary. 

But these erroneous ideas have to a great extent disappeared. 
In the following generation we perceive a certain reaction, a 
suspicion of laws and types. The statisticians mostly confined 
themselves to collecting exact details, giving, so to speak, a 
photograph of the conditions, but taking less interest in the 
causes behind all the phenomena. The doctrines of the pres- 
sent Italian school of anthropologists (Lombroso and his 
disciples) are perhaps the most closely akin to those held by 
Quetelet, but their views no doubt will gradually be changed 
under the influence of the strong and sober reaction which 
prevails among the statisticians of the present day. 

As remarked above, the calculus of probabilities is one of 
the main sources of the theory of statistics. This discipline 
in the past century has had a healthy growth, some of the 
finest mathematicians having contributed to its development. 
It is sufficient to name Gauss (method of least squares) and 
Poisson (Re'cherches sur la probabiliti des jugements, 1837). 
For many years statistics was too little influenced by these 
contributions. The official statisticians as a rule did not 
understand or appreciate the law of error. They were content 
to deal with statistical data in quite an elementary way with 
the result that much labor was wasted. It was often thought 
necessary to accumulate immense masses of data in order to 
prove results which might have been obtained from much 
less material, if the simple formulas of the calculus of prob- 
abilities had been used. 

But here, too, came a reaction. One of the leaders was 
Woolhouse, who published his remarkable article, On the 
Philosophy of Statistics in 1872. Later came the brilliant 
contributions of F. Y. Edgeworth and of Karl Pearson's school, 
which took up problems of great significance concerning the 
physical and moral development of men, particularly with 
regard to heredity. Important contributions to the theory of 
statistics were made by the well-known German economist, 
Lexis, and his disciple, Bortkiewicz. Lexis took up the prob- 



13] Scope and Method of Statistics. 237 

lem of regularity in statistical data, dealing principally with 
the ratio of males to females among new-born children. 

A prominent feature of modern statistics is the immense 
development of the field. The progress of labor statistics 
during the last few decades is an instance of such extension of 
official statistical work. A further broadening of the field 
is due to the private efforts of pioneers, supported by the 
generous gifts of men like Francis Galton, and to the activi- 
ties of life insurance and accident insurance companies and 
institutions investigating the mortality of the insured, or the 
frequency of invalidity, illness, or accident. 

In this extension of the field of statistics much has depended 
on the attitude of the people towards statistical inquiries. 
The spread of education through good public schools has 
been an important factor in allaying popular suspicion. 
The former distrust lest statistical inquiries might prove 
to be a preliminary to increased taxes has changed into a 
general confidence and a willingness to cooperate with the 
authorities. Consequently in the future there will be rather 
an embarras de richesses; the difficulty will be not so much in 
gathering material as in mastering it, in digesting all these 
masses of reports which have been stored in the archives and 
on the book-shelves of statistical offices. 

II. 

What is the present outlook for statistics? Some hints of 
future possibilities are suggested by the foregoing outline 
of its development. 

Let us consider first the future extension of the field of 
statistics. I have already mentioned the immense progress 
which has been made of late. It might perhaps seem difficult 
to make further advances, but the movement has been so rapid, 
that we can hardly think it will at once come to a standstill, 
and at all events some opportunities of further progress are 
evident. We need only to follow the lead of the old political 
arithmeticians. An essential feature in those early days was 
the frequent use of representative statistics, where observa- 
tions covering the whole ground were not available. This 
method was frequently used too naively, but its principle 
2 



238 American Statistical Association. [14 

was not bad. Laplace pointed in the right direction, claiming 
that the representative districts should be very numerous and 
spread over the whole area. What remains to be done, is 
simply to develop a theory of these representative enumera- 
tions, stating the limits of the deviations from the true 
proportions and showing how to approach as nearly as pos- 
sible to the truth. In fact in many cases it will be practi- 
cally impossible to do without representative statistics. A 
typical instance is the production of milk in a country. No 
statistician would think of measuring the whole quantity of 
milk, yielded by all the cows in the country. It is only 
necessary to select a certain number of cows, fairly repre- 
sentative of the whole, and to test the quantity of milk from 
these cows on certain days. From these observations the 
whole product may be approximately estimated. So also 
with regard to the production of butter, meat, etc. The statis- 
tics of crops, too, will always depend, somehow or other, on 
representative data. Again, we may ask what is the amount 
of timber, which can be produced in a forest: we can only 
solve the problem by choosing certain select areas in the forest 
and proceeding to measure the timber in each. The rates to 
be charged by insurance offices is another example. By 
taking samples of workshops, cotton mills, theaters or farm 
buildings, we can find their characteristic fire risk, and from 
these observations we can derive the rates for similar buildings. 
Interesting examples of representative statistics are furnished 
by the Norwegian investigations into the distribution of in- 
comes. By selecting certain places, certain streets and houses, 
and inquiring about the income and other circumstances of 
each person there, more detailed information could be secured 
than from returns covering a whole community, though in 
fact the modern systems of income taxes have made such 
knowledge easier to get than it used to be. 

Belated to the method of representative statistics are 
certain methods of interpolation. Supposing that the dis- 
tribution of a certain population according to years of age 
is known, but that in dividing the population into groups 
according to occupation it was found too expensive, in spite 
of the enormous technical progress of recent decades, — 



15] Scope and Method of Statistics. 239 

electric counting machines, etc., — to give the same details, 
so that the occupation classes are divided only into broad 
age-groups, then we can fairly well calculate the detailed 
age distribution, on the basis of the age distribution in the 
whole population. The formula of interpolation applied may 
be a rather crude one, as was the case with Sussmilch's esti- 
mate of infant mortality in one class of population on the basis 
of that of another, or we may invent more refined methods; 
at all events the principle will be the same. As an instance 
we may take the comparative mortality among widowers and 
and married men. If the numbers living are classified in ten- 
year age groups, 55-64, 65-74, etc., it is clear that in each 
group the mean age of widowers will be higher than the mean 
age of married men. For this reason there will be pro- 
portionally more deaths of widowers, even if their mortality 
is just the same, and we will be tempted to draw erroneous 
conclusions from the observations. By means of inter- 
polation we can reduce the age periods and thus remove the 
described source of error. 

III. 

Another group of most important problems relates to the 
correctness of the numbers observed. 

Statistical observations are undoubtedly much more correct 
nowadays than formerly. But of course these observations 
are not and cannot be entirely accurate. In enumerating a 
population at a given moment we can get fairly close to the 
truth, but it will be hardly possible to register all the inhab- 
itants without any omissions (for instance tramps) or alto- 
gether to avoid double counting. There will always be some 
little possibility of error left, and the question is, whether 
the limits of such errors and the effect they may have on our 
conclusions can be determined. Here we can learn from the 
old political arithmetic, for these authors, naive and un- 
practical as they often were, frequently had a clear under- 
standing of the importance of finding the limits of error in 
the statistical material. Sussmilch gives a good instance 
with regard to Berlin, Heysham with regard to Carlisle, and 
Milne's correspondence with Heysham concerning the 
Carlisle observations testifies to his anxious endeavor to 



240 American Statistical Association. [16 

correct his materials as much as possible before calculating 
his mortality table. This ought to be a characteristic of 
every statistical report; the degree of accuracy of the obser- 
vations should always be carefully considered. It would 
be a useful task to gather observations on the limits of error 
out of the enormous literature of statistical reports, and to 
try to systematize the results. Such a work would probably 
contribute not a little to our confidence in numerical data 
and would help us to draw valid conclusions in spite of the 
defects of the materials. There are in fact already available 
not a few discussions of this kind and undoubtedly more will 
appear as soon as the question has attracted the attention of 
official statisticians. 

In several cases we have ways of testing statistical data. 
Sometimes certain questions are answered from two different 
points of view: for instance, in labor statistics, when reports 
are secured both from employers and from trade unions. In 
statistics of international trade the import of goods from 
one country into another should correspond to the export of 
the same goods from the former country into the latter. 

Frequently it will not be possible without extra effort to 
determine the margin of error. For example, in age returns 
at some periods of life there is a temptation to overstate age: 
an infant 11 months old may perhaps be registered as having 
completed the first year of life; in extreme old age many per- 
sons, proud of their age, are tempted to exaggerate it a little. 
Again the gentler sex is often suspected of understating age. 
Further, there is the concentration on multiples of five or ten, 
common in census reports even where not only the age in 
years but also the date of birth is entered in the schedules; 
for many persons, ignorant of their exact age, are inclined to 
state it in years as a multiple of five or ten and then compute 
the year of birth by subtraction from the census year. A 
revision of the census figures is therefore necessary. In some 
cases a complete revision is possible. Thus in Norway, the 
country of centenarians, a careful revision of the data con- 
cerning persons in extreme old age was undertaken by com- 
parison with the parish registers of births. In other cases 
we must confine ourselves to making a thorough investigation 



17] Scope and Method of Statistics. 241 

of representative or select parts of the material, for instance 
certain parishes; from these parts it is often possible to draw 
valid conclusions applicable to the whole. 

A certain pessimism is often encountered among official 
statisticians or economists who attempt to draw conclusions 
from statistics. It is maintained that the inaccuracies are 
often so great that it is impossible to get any reliable results. 
Here, then, is another important problem in the theory of 
statistics, viz., to determine the significance of the inaccura- 
cies, to state to what extent it is possible to draw conclusions 
from statistical data, in spite of their imperfections. Here 
it would be most useful to form a theory of the applicability 
of imperfect data. For it can easily be shown that even 
extremely incorrect data may under certain circumstances 
allow of perfectly safe conclusions. It will be sufficient to 
discuss a few examples. 

Let us suppose that we want to find the mortality of hospi- 
tal patients suffering from pneumonia, grouped according to 
our estimate of their temperance or intemperance in the use 
of alcoholic beverages. Here the division line is very vague. 
Two persons with just the same habits may be registered, 
one as temperate, one as intemperate. The pessimists would 
maintain that observations of this kind are wholly without 
statistical value. But there is no reason to think that the 
observers will knowingly falsify the facts. There are patients 
whom every observer will judge intemperate and others who are 
as clearly temperate; between these two groups there are some 
whom it may be difficult to assign definitely to either group, 
and here different observers might make different decisions. 
Now let us further suppose that persons with intemperate 
habits suffering from pneumonia have a death rate of 40 per 
cent., whereas patients whose consumption of spirits is under 
a certain limit have a mortality of only 30 per cent., then as a 
consequence of the error in classification, the results may not 
show the true difference in relative mortality of the two 
groups, but it can easily be shown that the results will always 
indicate in which group the mortality is higher. If out of 
1,000 patients recorded as drunkards only 900 are really such, 
whereas 100 ought to have been registered as temperate, and 



242 American Statistical Association. [18 

a similar group of 1,000 "temperate" patients contains 100 
who are really intemperate, then the observed number of 
deaths in the former group would be, not 400, but only 
360+30 = 390, whereas in the latter group mortality would be 
270+40 = 310 instead of 300. The difference, still marked, 
is merely somewhat reduced. If mistakes were still more 
frequent, so that, for instance, 400 of one group ought to have 
been registered in the other one and vice versa, then mortality 
would have been 240+120 = 360 and 180+160=340; there 
would still be a difference left. Only if more were reported 
incorrectly than correctly, would we get an erroneous result; 
where this is not the case, we find a difference testifying to an 
unfavorable effect of intemperance, even if we cannot measure 
this effect; we can only say that the difference is at all events 
not smaller than that which we have observed. But in 
statistics it often matters little whether we can measure the 
true effect of the causes which are acting or not, if we can only 
prove that such causes are in existence. 

Or how does the health of children in parish-schools vary 
according to the economic circumstances of the parents? It 
may be that one observer will consider as belonging to a poor 
home a child whom another would report as well-to-do. Thus 
we have some chance of error in the observations, a fact, 
however, which does not prevent us from finding a difference 
in health among the two groups, if poverty has a deteriorating 
effect on health. 

Or we may take a typical text-book example. It is a well- 
known fact that the death rates of legitimate and illegitimate 
infants at different ages apparently follow different laws. 
Up to a certain age, legitimate children have a lower rate of 
mortality than illegitimate, but after that age, a higher rate. 
In former days statisticians were often inclined to ascribe 
this phenomenon to a selection among illegitimate children, 
the more delicate being overtaken by an early death, whereas 
the healthier children survived. Thus the stock gradually 
grew healthier, whereas among children born in wedlock the 
opposite would be the case, delicate children being kept 
artificially alive, till at last they succumbed in spite of all the 
care of their parents. But there is an evident source of error, 



19] Scope and Method of Statistics. 243 

viz., the legitimation of children by the subsequent marriage, 
of their parents. If such a child dies after the marriage, it 
will generally be registered as legitimate, though at birth 
it was reported as illegitimate. Consequently the fraction 
measuring the mortality of the illegitimate children is too 
small, of the other group, too large. As long as the mortality 
among legitimate children is smaller than among illegitimate, 
we know that there is an unfavorable cause acting on the 
latter and that the effect is even greater than the observed 
difference; but when the difference is negative, we are unable 
without further data to say whether the difference should be 
ascribed to defects in the data or to a selective process. We 
owe to the German statistician, Boeckh, a series of excel- 
lent investigations on this subject, proving that legitimate 
children in Berlin at no age of life had a higher mortality then 
illegitimate children. 

This example is another illustration of the fact that we are 
often unable to measure the exact numerical effect of a cause, 
though we can prove that it really exerts some influence. But 
often it is unnecessary to know more than that. If we have 
found that a certain cause is detrimental to health, we may 
leave it to social legislation or to hygiene to bring about a 
change if possible, to secure which it is not so important to 
know the exact statistical quantities. 

IV. 

A third problem of wide import relates to the "law of 
error." We have seen that the old political arithmeticians 
did not realize the necessity of investigating the limits 
within which the observed values deviate from the stand- 
ard, and, notwithstanding several efforts in this field of 
late years, much work has yet to be done. Outside of 
anthropometry, which presents interesting analogies to the 
law of error, not much had been done until Lexis took up 
the question, as mentioned above ; Bortkiewicz made interest- 
ing investigations, particularly with regard to small numbers 
of observations (Das Gesetz der kleinen Zahlen, 1898) ; and I 
may refer to my own investigations beginning as far back 
as 1884. 



244 American Statistical Association. [20 

Thus, in trying to explain the differences in the mortality, 
of married men and widowers, we might ask whether by 
chance there had been an uncommonly large number of 
widowers with ill-health, or living under such circumstances 
that they were more liable to attacks from epidemic diseases, 
or perhaps with a very small income, etc., — circumstances, 
which have no connection with conditions of married or un- 
married life. If this were the case, another investigation 
might give an opposite result and show a greater mortality 
among married men than among widowers. 

In order to find the limits of these chance deviations, we 
may first consult the theory of the calculus of probabilities, 
asking whether Bernoulli's theorem will give us a point of 
departure. Where the theorems of the calculus of prob- 
abilities hold good, the observations will be grouped around 
the standard value according to the binomial law, found by 
expanding (p+q) n , where p stands for the frequency of the 
event, q for its complement (p+q = l) and where n is the total 
number of cases. Where n is sufficiently large, we can ap- 
proximately measure the deviations from the standard by the 
"mean error" (\/npq). In two cases out of three the devia- 
tion from the standard will be smaller than the mean error and 
in 19 cases out of 20 smaller than twice this quantity, and a 
deviation over four times the mean error will be exceedingly 
rare. Simple as this formula is, we can simplify it still more, 
if the value of p or q is not very great. If p = 0.1 and n = 10,000, 
we have a mean error, V10,000X 0.1X0.9 = 30. If we disre- 
gard q = 0.9 in the formula, the mean error will be only a little 
more; namely, Vl, 000 = 3 1.6; we have then only to take the 
square root of the number of deaths (np) as a standard of the 
deviations. Practically this change makes little difference, 
and where the value of p is still smaller, the difference will 
often be very small. Thus, if p = 0.01, n = 1,000,000, we 
have for the mean error V990 or 31.5, but leaving out q = 0.99, 
we get •v / l,0U0 = 31.6, or a quite insignificant difference. 

The use of this formula is evident. If the observed number 
of deaths is 1,020 or 1,030 instead of 1,000, we may look upon 
it as an ordinary deviation, which need not be ascribed to 
peculiar causes. But should the deaths increase for instance 



21] Scope and Method of Statistics. 245 

to 1,100 or 1,200, we should naturally infer that some peculiar 
causes have been acting, and the next step will be to make 
further observations in order to get a clearer understanding of 
the matter. 

The same formula in a little more complicated form can be 
applied to the chief problem in medical statistics; viz., to find 
whether a particular method of treatment of a disease is 
effective. Let the mortality of patients suffering from the 
disease be p 2 , when treated with a serum, pi, when treated 
without it, and let the numbers of observations in each case be 
«2 and ri\. Then the mean error of the difference between the 

frequencies of dying in the two groups will be \ ^-^ + — 2 , 

» Til W2 

and we can get an approximation by putting the observed 
relative values instead of pi and p 2 . If, for instance, 200 
patients have been treated in each group, the numbers of 
deaths being 40 and 80, the observed difference thus being 
0.20, we shall have the mean error of the difference 
VO.008+0.0012 = 0.045. The observed difference is four to 
five times the mean error, and we are consequently justified in 
believing in the favorable effect of the treatment by serum. 

But the question arises: is this test applicable, or do the 
casual deviations follow some other law? As mentioned 
above, the students of the calculus of probabilities were often 
mistaken, for explicitly or implicitly they considered the 
theorems as axioms, which would hold good at least in all 
games of chance. In fact this can only be tested by experience. 
Although d'Alembert in his controversy with other mathe- 
maticians had the worst of it, he was right in wishing observa- 
tions; he had actually touched a weak spot in the theory. 

It may, however, be considered as sufficiently proved by 
experience that the calculus of probabilities can be applied 
to the problems of games and similar questions. How shall 
we explain this result? If we toss a coin in the air, or throw 
dice on a table, there will always be numerous circumstances 
which we cannot register at all, but which combined contribute 
to the definite result of the games. The die is on the table in 
a certain position and at a certain distance from the player; 
the position of his hand varies; the die is thrown with varying 



246 American Statistical Association. [22 

force; etc. If all the circumstances each time were exactly 
alike, then the same thing would always happen; if the die 
had once shown a six, the same number would be shown at 
every throw; if one card had once been drawn, the same card 
would constantly appear, etc. In fact all these innumerable 
individual causes cooperate, so to speak, so that their effect 
partially disappears, and we are consequently enabled, within 
the limits given by the "law of error, " to foretell the collective 
result of a number of experiments. We cannot tell what 
number the die will show at a single throw; but if we repeat 
the throw, for instance 12,000 times, we know from experience 
that about 2,000 of these cases will give sixes, and we can with 
the help of the "mean error" easily find the limits of the 
possible deviation from this standard. The mean error is 
41, consequently we will very rarely expect a deviation of, 
say 120 or 160; if the die should give sixes 2,200 times instead 
of 2,000, we should certainly look upon it as "false"; it must 
have had some quality favoring this event, possibly a certain 
part of it was heavier than a corresponding part of other dice. If, 
then, we have found sixes 2,200 times, we know that there is a 
peculiar cause which we shall have to find in some way or other. 
The greater the difference, the easier will it be to find the cause. 

Now, in vital or economic statistics we have an analogy to 
these phenomena. Here, too, a certain regularity prevails; 
particular individual causes are, so to speak, eliminated in 
the collective results. No two persons are alike; one will be 
attacked by an epidemic disease, another will escape; one 
will marry, another remain single; etc., but on the whole in a 
certain class a certain number of deaths or marriages will take 
place. The question now is, What are the limits of deviation 
from the average? Can we find a conformity to the binomial 
law of error similar to that found in playing at dice or cards, 
or do other laws of frequency apply to these phenomena? 

In vital or economic statistics most numbers have a much 
wider margin of deviation than is experienced in games. Thus 
the death rate, the birth rate, the marriage rate, or the relative 
frequency of suicide fluctuates within wide limits. But it 
can be proved that, by dividing the observations, sooner or 
later a marked tendency to the binomial law is revealed in 



23] 



Scope and Method of Statistics. 



247 



some part of the observations. Thus, the birth rate varies 
greatly from year to year; but every year nearly the same ratio 
between boys and girls, and the same proportions of stillbirths, 
and of twins are observed. The causes which bring about an 
increase or decrease in the number of births thus have appar- 
ently the same influence on both sexes. 

A curious illustration of this is found in the statistics of 
divorces in Berlin, 1899-1908. 





Number of Divorces. 


Percentage of Divorces. 


Year. 


Total. 


Protestant. 


Mixed Religious 
Belief. 


Protestant. 


Mixed. 


1899 


1,608 
936 
984 
1,227 
1,269 
1,376 
1,421 
1,639 
1,781 
1,868 


1,261 

732 

769 

972 

9.81 

1,048 

1,103 

1,271 

1,350 

1,444 


232 
128 
149 
174 
198 
229 
219 
266 
298 
291 


78 
7S 
78 
79 
77 
76 
78 
78 
76 
77 


14 


1900 


14 


1901 


15 


1902 


14 


1903 

1904 


16 
17 


1905 


15 


1906 


16 


1907 


17 


1908 


16 








14,109 


10,931 


2,184 


77 


15 



It appears from this table that the absolute number of divor- 
ces in Berlin has fluctuated widely. A change in the divorce law 
accounts for the low number in 1900 compared with that of 
1899, but except for this the numbers have been constantly on 
the increase. The distribution according to religious belief, 
however, is very regular; calculating the mean error of the per- 
centage numbers, we find that the fluctuations are within the 
limits of the binomial law. Hence we may reasonably con- 
clude that the causes affecting the numbers of divorces act 
with the same force on the two groups of marriages; the in- 
dividual causes affecting the percentages in each group have 
collective results of the same nature as in the case of games. 

It is easy to multiply illustrations of this kind, showing the 
grouping of relative numbers about a standard. By the test 
of experience, then, it is possible to apply the binomial law of 
frequency in many fields of vital statistics. It seems more 
difficult to apply the law to absolute numbers, calculating 
for instance, the number of births or deaths in a given year, 



248 American Statistical Association. [24 

for the fluctuations of these numbers generally exceed the 
limits of the binomial law. Still we can sometimes find the 
causes of variations and thereby calculate the absolute numbers 
more precisely. As regards the number of births it is neces- 
sary to make allowance for a tendency to a decrease of fertil- 
ity; the number of legitimate births depends on the number 
of married couples, the age of the parties, and the duration of 
their marriage. A calculation for Denmark through a period 
when there was no noticeable tendency to a decline of fertil- 
ity, taking into consideration only the fertility of married 
women by five-year age groups, in urban and rural districts, 
but without making allowances for duration of marriage or 
age of husbands, gave a favorable result, showing that it was 
not impossible to calculate with tolerable accuracy the num- 
ber of legitimate births in a particular year from the experi- 
ence of previous years. 

As to the deaths it seems more difficult to calculate the abso- 
lute numbers, because meteorological and economic causes 
exert an influence the numerical effect of which it is hardly 
possible to measure. By leaving out of consideration ages at 
which variations in climatic conditions cause great fluctuations 
in deaths (infancy), we may make some advance towards the 
binomial law. But there is no difficulty in getting several 
important results concerning relative numbers. The level of 
mortality may be very different from year to year, but we 
can perceive a tendency to the binomial law in the relative 
numbers, the death rates by age, sex, occupation, etc. By 
observing the death rate of two occupations in the same period 
and in the same country we may find that mortality in these 
professions is rising or sinking with the general level, and we 
are in a position to conclude from the figures which occupa- 
tion is more healthful. 

So, too, in the case of divorces. The absolute number 
varies greatly, as we have seen, but the distribution according 
to religious belief is rather regular. Knowing that 3 per 
cent, of the dissolved marriages in Berlin are Catholic, whereas 
about 5 per cent, of the weddings are Catholic, we are war- 
ranted in inferring that Catholic marriages are less likely to 
be terminated by divorce than others. 



25] 



/Scope and Method of Statistics. 



249 



In economic statistics it is more difficult to apply the bino- 
mial law, the conditions being as a rule much more complicated. 
But if we carefully single out homogeneous observations, we 
shall often find a decided tendency towards the binomial law. 
I may refer to interesting examples in H. L. Moore's work, 
Laws of Wages (1911). Another instance can be taken from 
the Copenhagen statistics of wages (1909). We have here 
the following distribution per 1 ,000 : 



Range of Yearly 
Income in Crowns. 


Wage-earners 
per 1,000. 


Range of Yearly 
Income in Crowns. 


Wage-earners 
per 1,000. 


Under 100 


7 
24 
37 
40 
52 
62 
70 
66 
58 


Under 900 


416 


100-200 


900-1,000 


18 


200-300 


1,000-1,200 


48 


300-400 


1,200-1,400 


139 


400-500 


1,400-1,600 


167 


500-600 


1,600-1,800 


119 


600-700 


l,S00-2,000 


77 


700-800 


Above 2,000 


16 


800-900 




1,000 


Under 900 


416 









This distribution has little resemblance to the binomial 
law: it has two maxima. But by dividing the observations 
into several classes we may discover a tendency towards the 
law. Thus, we can ask how each occupation is grouped around 
its special average, how many persons have, for instance, 
at least 100 or 200 crowns yearly more than this average, etc. 
If we then displace all these series of observations so that all 
the centers coincide, we shall find how all these working men 
and women together are grouped around their averages, and 
we may compare this to the exponential law (which closely 
resembles the binomial law). The result will be as follows: 



Deviation from Average 


Frequency. 


Frequency Expected 

According to 

Exponential Law. 


Under the 
Average. 


Above the 
Average. 


(Crowns) 
0-100 


223 

154 

84 

33 

6 

3 

2 

2 

2 

509 


212 

148 

73 

29 

19 

9 

1 





491 


195 


100-200 


152 


200-300 


91 


300-400 


42 


400-500 


15 


500-600 


4 


600-700 


1 


700-800 





800-900 





Total 


500 







250 American Statistical Association. [26 

On the whole it may be said that there is a tendency toward 
the exponential law; and this will hold true, if we deal separ- 
ately with males and females, with skilled and unskilled 
workers. If some few occupations show conspicuous differ- 
ences from the law, it always seems possible to find the reason. 
Thus, in gasworks there seem to be two distinct groups of 
laborers, each group with its own average. 

In anthropometry we have a good analogy to the theorems 
of the calculus of probabilities. The distribution of height 
among young men registered for military training is an excel- 
lent example. This analogy enables us to find interesting 
results with regard to typical differences between rural and 
urban population, rich and poor, etc. 

If no conformity to the binomial law can be found in a 
series of observations, the first task is the classification of the 
data. The most obvious classifications are by sex, age, pro- 
fession, residence, etc. 

If we do not at once reach the binomial law, we must try 
further subdivisions. Since each of these classes or subclasses 
has its peculiar conditions, and is under the influence of dif- 
ferent systems of causes, we can say that the more quickly we 
reach the point where the binomial law holds good, the fewer 
causes have been acting, whereas the greater the deviation 
from the binomial law, the greater the number of active causes. 
An instructive instance of this is found in the discussion of the 
balance of male and female births — one of the most interesting 
chapters of the history of statistics. According to Hofacker 
and Sadler, if proportionally many young men died — as in 
the present war — then nature would try to restore the balance, 
so that the proportionate number of male births would in- 
crease; couples where the wives were much younger than their 
husbands would have a relatively numerous male issue. The 
difficulty here is simply that we generally find very soon a 
conformity to the binomial law. Moreover, statisticians 
who dealt with this problem frequently misunderstood the 
results; in fact very often they had not observations enough, 
the differences they found being within the limits of the law of 
error. The late Dr. Geissler found two sets of causes acting 
against each other and thus creating a tendency to balance: 



27] Scope and Method of Statistics. 251 

there are families with relatively many chances of getting 
children of one sex only, but where this uniformity is once 
broken, then there is greater probability in future of getting 
children of the other sex. 

V. 

As far as I can see, the typical frequency curve in all vital, 
social, or economic statistics is always the binomial one; but 
it will require much investigation by statisticians to get to the 
bottom of this question, to prove whether this supposition 
is right, or under what conditions the observations will show 
a tendency to the binomial law. This third problem of the 
theory of statistics will require much patient labor in careful 
analysis of each separate class of material. If the observations 
show the validity of the binomial law, then we know that we 
can use the above described test of the mean error with little 
or no hesitation and that on the whole the calculus of proba- 
bilities is applicable. 

Many statisticians have tried to apply other frequency 
curves than those with which I have here dealt. We are in- 
debted especially to Karl Pearson and his disciples for im- 
portant investigations on this question, leading, for instance, 
to curves with "skew variations," etc. The binomial law is 
generally a little askew, but for all practical purposes we may 
set aside this small difference from the symmetric form. Very 
often, however, we meet with distributions of quite another 
form; how are we to deal with such phenomena? 

I may here insert that it is not as a rule necessary to know 
the mathematical formulas of these distributions. We may 
deal with them well enough, if we only know by experience 
the form of the curves. Distributions according to the bino- 
mial or the exponential law need no strict mathematical treat- 
ment, the observations from games will give us sufficient 
materials for finding the form of the curves rather accurately, 
even for more complicated data. A curious instance is pre- 
sented by the Danish statistics of fertility. If marriages of 
10-15 years' duration are classified according to social position, 
certain classes, represented by 3,759 families, are found where 
the birth rate is high, and others, represented by 2,082 families, 
where it has declined to a low figure. If the couples in both 



252 



American Statistical Association. 



[28 



groups are distributed according to size of family, after a 
certain size of family is reached, we find that the numbers 
in both groups show a close correspondence. Out of 2,082 
families with low fertility, 192 had more than six children each, 
distributed in practically the same way as the corresponding 
753 couples in the group with a high birth rate. But these 
753 correspond to 3,006 couples in the group with the high 
birth rate with six children or less; in the group with a low 
birth rate we may separate out a corresponding number, 766 
couples with six children or less, — this number bearing the 
same proportion to 192 as 3,006 to 753. Assuming that these 
766 are distributed as in the larger group, we have the follow- 
ing figures: 



Number of Children 


958 Couples with a 
High Birth Rate 


1,124 Couples with a 
Low Birth Rate 


The Whole Group 





94 

60 

99 
108 
128 
149 
128 

766 

102 
51 
25 
14 

192 

958 


214 
180 
260 
187 
196 

55 

32 

1,124 


308 


1 


240 


2 


359 


3 


295 


4 


324 


5 

6 


204 
160 








1,890 


7 


102 


8 


51 


9 


25 


10 and over 


14 






192 






Total 


1,124 


2,082 







We have thus found that 958 couples seem not to have lim- 
ited the number of their offspring, whereas the remaining 
1,124 couples appear to have had modern ideas. This is of 
course only a preliminary investigation. In reality we should 
have split the group into several classes, each with its own 
birth statistics. At all events it appears from these numbers, 
that the movement to limit the birth rate had reached more 
than half of the group concerned. 

It is apparent in a case like this that it is unnecessary to find 
a mathematical expression for the distribution of families in 
the two groups. Nor will it be advisable. On account of the 
great rapidity of the decline of the birth rate the distribution 



29] Scope and Method of Statistics. 253 

of the groups will undergo considerable changes, which cannot 
be foreseen in advance, at least as regards the group with a 
low birth rate. The curves which we should find would there- 
fore not be typical, and it would be of no use to make the 
calculations. 

But the same objection can be made to several other curves 
designed to express the distribution of observations in some 
field or other. One of the most interesting examples is that 
of the distribution of income among the inhabitants of a 
country. In trying to draw such a curve we shall certainly 
find a very skew distribution, the frequency curve differing 
immensely from the curve conforming to the binomial law. 
But if we try to divide society into different classes, we shall 
find in each one a separate distribution with its own center of 
gravity. In Danish income statistics the figures show a 
marked tendency for the separate groups to be distributed 
about an average in a way resembling the binomial distribu- 
tion; but, if all the groups are taken together, the curve for 
the whole population assumes a "pyramidal" income distri- 
bution (Pareto's law). It is not safe to stop here; a simple 
change in the constitution of the society, a labor conflict 
resulting in higher wages, for instance, may give the "pyra- 
mid" another form. This will be the case, too, if the commer- 
cial classes get higher profits from their business, or if the share 
of rent decreases or increases, etc.; all these causes are to a 
certain degree connected with one another, but so much de- 
pends on unknown factors that we can not well foresee the 
changes of the figures. Taking each class alone, we have a 
much better foundation for our investigation, and then we 
can return to the total population to get a view of the collect- 
tive results of all these peculiar movements. Using class 
statistics as a starting point, we are in fact much nearer the 
significant causes. If we seek a formula for the combined 
effects of all the causes in action, we run the risk of overlooking 
some, which it would really be exceedingly important to take 
into consideration. Much labor can indeed be lost in this way. 

The number of sick days after an accident affords another 
instance. As there are always proportionally very few cases 
of protracted illness after an accident, the result will again be 



254 American Statistical Association. [30 

a very skew variation or a pyramidal distribution. But in- 
stead of trying to give this distribution a mathematical ex- 
pression it would pay better to classify accidents in special 
groups, each with it peculiar center of gravity. 

The life table is another example. Here we have a com- 
bined expression necessary for practical use in life insurance, 
for calculating the present value of future assets and liabilities. 
But several mathematicians have erred in thinking that it 
would be possible to find a mathematical law of mortality, a 
physiological law, as it were. We have several formulas of 
this kind, by Lambert, Moser, Gompertz, Makeham. For a 
a certain period of life Makeham's formula is exceedingly 
practical, but after all it is only a beautiful formula of inter- 
polation. It does not hold good at the end of life, for after 
the age of 80 mortality seems to vary according to another 
law, nor does it hold good in infancy. And from 20 to 80 
the variation is by no means always as uniform as the formula 
represents; a new treatment of tuberculosis may cause a dim- 
inution of the rates of mortality in a certain period of life, the 
entrance into married life improves the health, military serv- 
ice causes an increase of mortality, etc. Under certain cir- 
cumstances there will be even stronger influences, thus, for 
instance, the mortality of sailors, of persons with inherited 
diseases. A general mortality table will always be an aggre- 
gate of numerous special tables, and, if the groups which form 
the whole community, undergo change, then the life table 
itself will change, and it may prove impossible hereafter to 
apply the formula. For a close study of the causes at work 
it will be necessary first of all to take into consideration the 
mortality at each age and in each group. 

VI. 

Whether or not we use mathematical formulas for these 
combined expressions there are certain mathematical prob- 
lems in dealing with these complex averages which a statis- 
tician must be able to master. 

In trying to get comparable data it is often necessary to go 
through much preparatory labor. If the numerator and de- 
nominator in the fraction measuring the frequency of an event, 



31] Scope and Method of Statistics. 255 

are not quite homogeneous, it may be necessary to make cer- 
tain interpolations. We may have to find by interpolation 
from the census figures taken every tenth year the number of 
persons at a given age, who are exposed to the risk of death 
at a certain moment. On the whole we may here lay down 
the principle that this interpolation first of all should keep as 
near real life as possible, taking everything into consideration 
which might prove of importance, such as the fluctuations in 
emigration or immigration (as far as we can ascertain them), 
or of births and deaths. Complicated as these problems fre- 
quently are, it may be maintained that on the whole they have 
been solved with sufficiently close approximation. In fact, 
in many cases the different methods of interpolation lead to 
nearly identical results. This is true also with regard to ad- 
justment or graduation, a process closely related to interpola- 
tion. An experienced mathematician as a rule will get nearly 
the same results whether he adjusts the figures graphically, 
"mechanically," or by some mathematical formula, with or 
without application of the method of least squares. The 
possible progress in this field will be therefore chiefly that 
of refinement, smoothing the values to make the changes 
appear as regular and even as possible. But on the whole, 
the theory of statistics cannot expect much improvement from 
such efforts beyond what has already been achieved. 

The best means of treating the changes of a variable in a 
series of observations is the method used by Daniel Bernoulli 
in the eighteenth century in studying the influence of smallpox 
on mortality and further developed by Duvillard at the begin- 
ning of the nineteenth century. They used the fiction that all 
variations were continuous, so that it was possible to use the 
differential calculus instead of operating with finite differences. 
This method is of great value, as it simplifies many problems 
extremely; on the whole, it seems to have been generally 
adopted, though not always without hesitation. 

According to this method we calculate, for instance, the 
force of mortality, a quantity measuring the risk of dying in 
the next infinitely short moment. This quantity being n z , 
where x is the age, we have as the number of deaths during a 
moment of length d x , with l x standing for the number exposed 



256 American Statistical Association. [32 

to risk, n x l x d x . If l x signifies the number surviving at age x 
according to a life table, then the number surviving at age 

l x +d x will be l x — - x . The number of deaths during the 
d x 

moment d x being also ix x l x d x , we have ti x l x = -. If we are 

' d x 

warranted in making certain simple suppositions with regard 
to the force of mortality, this equation will lead to simple ex- 
pressions for l x . But not only can we use the force of mor- 
tality as it stands, we can also split it into its elements, finding, 
for instance, the rate of mortality from a single cause of death, 
or a group of causes, and further asking what will be the result 
if one or more causes disappear. Or we may introduce another 
quantity, the rate of marriage, corresponding to the force of 
mortality, and ask how many persons will survive as single, 
how many will marry within a certain interval, how many 
married people will die in the same period, etc. It will cause 
no difficulty, if the force of mortality is supposed to be different 
among bachelors and married persons. The same methods 
can be applied to other problems, for instance, finding the 
number of persons who become invalids or who recover or who 
die after recovery. Here, too, we can use different values for 
the rate of mortality and allow for the influence of age and the 
duration of invalidity. Again, we can determine the prob- 
ability of a person being sentenced for a crime, once, twice, 
etc. The problems are everywhere of the same kind and they 
present no particular theoretical difficulty. In fact, the main 
difficulty at present seems to consist in providing sufficiently 
correct and complete observations rather than in dealing with 
the materials afterwards. 



VII. 

Where are we at present with all these problems? We 
might describe the program of vital statistics as the task of 
following human beings from the cradle to the grave, taking 
account of all their chances, their good fortune and mishaps, 
their physical and intellectual growth, their criminality and 
moral defects, etc., etc. What will be a young man's choice 



33] /Scope and Method of Statistics. 257 

of profession? Will he follow in the steps of his father or not? 
What will be his chance of getting a better position in society 
or the risk of his being forced downwards in the struggle? 
What will be the danger of unemployment, and, after having 
lost a position, what is the chance of finding work again? 
What income will a man on an average earn in each occupation 
and at different ages, and what will be his economic position 
on reaching old age? Again, will education pay in the com- 
petition of life, will a boy who begins as apprentice in some 
trade get a substantial pecuniary advantage, or will he be 
actually no better off than an unskilled laborer? 

It will be readily seen that there is still a long way to go. 
Thus, as to birth statistics, we shall have to add considerably 
to the present store of statistical data on fertility in different 
classes of society as influenced by the duration of marriage 
and the age of consorts. In marriage statistics we shall have 
to observe the different rate in different classes with allowance 
for the fact that very often a wedding is registered for two 
places of residence — that of the bride and of the bridegroom — 
and that the wedding frequently is followed by a change of 
occupation. 

As to mortality statistics there used to be an enormous 
confusion. In the middle of the past century medical statistics 
suffered severely from dilettanteism, especially as regards the 
correlation between mortality and occupation. Much was 
accomplished in the way of clearing up the subject in England 
by the efforts of William Farr and his successors, but there is 
still a serious lack of material. The same may be said of 
statistics of sickness and invalidity, though observations dur- 
ing a century of experience in Friendly Societies are available 
and German compulsory insurance against invalidity, with its 
complicated machinery, has been in force for 25 years. Other 
problems have hardly been touched as yet, thus the mortality 
of persons suffering from certain diseases, such as syphilis, 
mental diseases, etc., the mortality of persons rejected by life 
insurance companies, the influence of heredity, etc. 

Still less cultivated is the field of migration statistics. We 
have some data with regard to transatlantic migration, but not 
much more. Still it would not be impossible to find approxi- 



258 American Statistical Association. [34 

mately the rate of migration for different ages (corresponding 
to rate of mortality) . If we know from two consecutive cen- 
suses the numbers of persons born in a certain year, and if 
we know further the rate of mortality, we can find approxi- 
mately the rate of migration (excess of emigration above immi- 
gration) ; and knowing in some places the numbers of those who 
immigrated a short time before the census, we have some data 
by which we can separate this rate of migration into its ele- 
ments. 

Still more complicated are the problems of economic statis- 
tics, for here we have frequently one more dimension. We 
have to deal not only with numbers or quantities, but also 
with their value in money. Only a very small part of the 
work has been done, and the scanty materials we have are 
frequently of doubtful value, because it has rarely been pos- 
sible to separate them in such a way that the numbers are 
quite homogeneous. This is the main reason why statistics 
have so often been considered unreliable. 

In agricultural and industrial statistics we have above all 
the problem of finding the quantity produced and its value in 
money; as mentioned above, representative investigations 
in this field are particularly useful or even indispensable. 
Other important problems to be investigated are the distri- 
bution of enterprises by size of farm or industrial establishment, 
and the working of the law of increasing returns in industry. 
Statistics of international commerce are somewhat developed, 
though there are evident defects in the returns, but of internal 
commerce we know practically nothing. The variations of 
price under the influence of supply and demand form another 
series of important statistical problems. Again what is the 
rapidity of circulation of a coin or a bank-note as it passes 
from hand to hand, and how can we get observations on the 
influence of bank reserves or of the rapidity of circulation on 
prices and the rate of interest? As a link between vital and 
economic statistics we have, moreover, the question as to the 
yearly income of a whole nation or a class of population, the 
statistics of public finances, shifting and incidence of taxation, 
etc. Even here only very few of the important problems 
have been thoroughly dealt with. 



35] Scope and Method of Statistics. 259 

VIII. 

It appears from this sketch of future problems, the list of 
which any economist might easily extend, that statisticians of 
the present day have no lack of work. But frequently we 
observe among statisticians, a certain despair, of their ability 
not in the collection but in the interpretation of statistical 
data. In consequence of this pessimism there are statisticians 
who confine themselves to giving a true description of the 
phenomena. But we must not forget that a true description 
cannot be given if we cannot find the relations of cause and 
effect that prevail, for else we run the risk of giving a confused 
mass of details instead of a clear outline Of the main features 
of the observations concerned. 

It must not be forgotten, however, that the task of the 
statistician is not so much to find the causality himself as to 
help others to find it. The statistician must be content if he 
can show that certain groups of numbers show marked differ- 
ences, leaving it to physiology, meteorology, and other sciences 
to explain these differences. Thus his task is indeed a very 
modest one, but it is always satisfying to be able to put others 
on the right track or to spare them from some misapprehension 
of the facts. 

Often enough the statisticians will be obliged to acknowl- 
edge that there are two or more causes which cannot be sepa- 
rated; it may even be that two causes have a much greater 
effect if combined than if working separately. So with regard 
to labor and capital. If a man works without any capital, 
his produce is quite insignificant; and, if capital is left alone, 
its product will be zero; no one can tell therefore which part 
of the produce is to be ascribed to labor and which to capital. 
A simple example will show how easily confusion arises. 
Suppose that certain goods to the value of $10,000,000 are 
exported. If now the exported quantity increases 50 per cent., 
and prices 20 per cent., then the value of the export will be 
increased to $18,000,000. If prices were unaltered, but the 
quantity only changed, then there would be an increase of 
value of $5,000,000; if prices only varied, we should have 
$2,000,000 more; thus we could account altogether for $7,000,- 



260 American Statistical Association. [56 

000; the real addition was $8,000,000, or $1,000,000 more. 
In this case it is easy enough to see what the matter is, but 
very often the problems are much more complicated. An 
example from anthropometry will illustrate this. 

Let us ask whether height is an inherited characteristic. 
To make the case as simple as possible, let us suppose that 
there are in a country two different types, each with its pecu- 
liar average, round which the individuals are grouped. A 
person belonging to the tall type may by chance be of small 
stature, and vice versa; and, if we divide the population into 
two groups according to height, we shall have a smaller or 
larger number of the short type in the group of tall men and 
vice versa. Supposing now that the type is inherited, whereas 
the casual causes of deviations from the typical height are not 
transmitted to the offspring, we have in the second generation 
two groups, each numbering persons of both types, but prob- 
ably with an unequal distribution. Each type having its own 
center of gravity, the observed average in each group will lie 
between them, that is to say, the second generation will 
apparently show a smaller distance between the two sizes 
than was the case with the fathers. Galton and his disciples 
would speak here of regression, human beings having appar- 
ently a tendency to revert to former types, but it follows from 
the above that this regression is only a formal effect of the 
nature of the observations; it does not really lead us to an 
understanding of the causality. 

IX. 

Let us now suppose that we have a group of observations 
in which only one rate of frequency may be said to prevail; 
then, as shown above, we have the means of controlling our 
conclusions by finding the limits of deviation from the stand- 
ard value. This is the most simple of the statistical problems. 
But generally the data are not homogeneous. Most frequently 
the material must be split up into several groups, each with 
its own frequency, and we have then the problem of getting 
an idea of all these relations together. 

We can sometimes confine ourselves to studying each of the 
groups separately in comparison to other corresponding groups 
of observations. The two series may stand so clear of each 



37] 



Scope and Method of Statistics. 



261 



other that no doubt is possible. When in a certain occupation 
the mortality in each age group is decidedly greater than in the 
general population, it is clear that the occupation concerned 
is under the influence of some unfavorable cause. But fre- 
quently the differences are not all in the same direction, some- 
times some of the rates in the occupation are lower than in the 
general population, and the evidence thus is apparently con- 
tradictory. This irregularity may perhaps be due to acci- 
dental causes, the influence of which would have been elim- 
inated, had the number of observations been larger. It will 
therefore be necessary to combine these results in order to get 
a clear idea of the influence prevailing in the occupation. 
Here we have many methods of procedure. In determining 
which method is the best, it is a good principle always to keep 
as near as possible to the original observations, never leaving 
them out of sight. Taking, in the present case, the mortality 
of the general population as a standard, we may calculate 
some expression or other from the two series of observations; 
the mean duration of life, the value of a life annuity, etc. But, 
according to my experience, it is always best to choose a 
method which permits us to keep in view the original numbers. 
In this way it will be easier to avoid misapprehension and 
erroneous conclusions. 

The simplest standard calculation in the present case is to 
find the budget of deaths in the occupation, — the deaths that 
would have occurred in the occupation if the mortality had 
been that of the general population or other group, — which 
for some reason we might consider typical. To illustrate this, 
I shall begin with a very simple case, that of barristers and 
solicitors according to English vital statistics (1900-1902). 
The following table gives all the necessary details : 





General 
Population. 


Barristers and Solicitors. 


Period. 


Death Rate 
per 1,000. 


Population. 


Actual Expected 
Deaths. | Deaths. 


Death 

Rate per 

1,000. 


25-34 
35-44 
45-54 
55-64 


6.38 ! 16,323 
10.94 19,086 
18.67 ; 13,869 
34.80 j 0,741 


77 
142 
189 
188 

596 


104 
215 
259 
235 

813 


4.72 

7.21 

13.63 

27.89 


25-64 


14.08 


56,619 


10.53 



262 American Statistical Association. [38 

The simplest procedure would be to compare the rate of 
mortality for barristers and solicitors at all ages (10.53) with 
that of the general population (14.08), but as the age distribu- 
tion is not quite the same, this comparison will not be without 
objection. The rates for each age, however, show very much 
the same proportion to the corresponding general rates, and 
we might stop with that; but, if we want a common expression, 
the budget of deaths will give a practical means of comparison. 
Under the supposition that the mortality is the same as in the 
general population, we find in the youngest group 104 ex- 
pected against 77 actual deaths, in the next age group we have 
215 against 142, etc., altogether 813 against 596, the actual 
mortality being thus only 73 per cent of the expected one. 
The mean error is 29 or about l /i of the deviation (217); the 
mean error in the single age groups is between 1 / 8 and 1 / e of 
the corresponding deviations. The mean error is then pro- 
portionally much smaller, if we take the total, than if each 
group is taken individually. This is a great advantage, for 
often there are so many age groups (for instance, yearly inter- 
vals) that the deviations from the type are very irregular, 
whereas the total may allow of quite a safe deduction. The 
separate bits of evidence may be rather uncertain, but if com- 
bined they may warrant a definite conclusion. 

The calculation of expected deaths will be particularly 
advisable in case the age distribution of the deaths is unknown. 
As remarked above, we cannot draw sound conclusions from 
the crude rate of mortality, but if the expected mortality is 
calculated, we have in the actual and the expected numbers, 
two homogeneous quantities, which we can safely compare. 
In a case like this it may be objected that mortality in one 
age group very well might be greater than the expected; we 
cannot tell whether mortality in the age group 25-34 actually 
has been lower or higher than in the general population. But 
at any rate it is clear that in some age group or other mor- 
tality must have been decidedly lower among barristers than 
in the general population; we have thus found a valuable 
result, though it may be worth while to subject it to a closer 
investigation. But ordinarily one might reasonably assume 
that the mortality in all age groups is lower, as is actually the 
case here. 



39] 



/Scope and Method of Statistics. 



263 



English statisticians often use a modification of the method 
just described of calculating expected deaths; viz., the method 
of "standards" (in fact, the method of expected deaths can 
quite as well claim the name of a "standard" method). The 
difference between the two forms will be easily understood by 
using the same example as above. First, let us calculate the 
number of persons among whom, according to the general 
table of mortality and with the age distribution of the general 
population, 100 deaths would occur. Let us further calculate 
the number of deaths which would take place in this "stand- 
ard" population, if the mortality were the same as observed 
among barristers and solicitors. The figures are as follows: 





Standard 
Population. 


Mortality 


ler 1,000. 


Expected Deaths. 


Age 
Period. 


Genera! 
Population. , 


Barristers 

and 
Solicitors. 


General Barristers 
Population. and 

Solicitors. 


25-34 
35-44 
45-54 
55-64 


2,625 

2,040 

1,475 

959 

7,099 


6.38 
10.94 
18.67 
34.80 

14.08 


4.72 

7.21 

13.63 

27.89 

10.53 


16.75 
22.32 
27.54 
33.37 


12.39 
14.71 
20.10 

26.75 


25-64 


100.00 


73.95 



According to mortality experience among barristers and 
solicitors, in a standard population approximately 74 would 
die instead of 100 according to the general rates of mortality. 
This is very nearly the same proportion as reached above. 
In the present case the two forms of comparison lead to nearly 
the same result, and this will generally be the case, if the age 
distribution in the special group is not much different from 
that of the general population. But on the whole the method 
described last is a little more complicated than the calculation 
of expected deaths, and in particular it is not applicable, if the 
age distribution of the deaths of the barristers and solicitors 
is unknown. 

Still more complicated are other methods of procedure, such 
as the calculation of the mean duration of life or of the value 
of life annuities. In the present case we have the mortality 
experience for only a part of human life, and for ages above 
65 we should be obliged to make estimates. Another way of 
comparison would be to calculate the number surviving at 65 



264 American Statistical Association. [40 

out of a certain initial population. Such calculations may 
often throw light on the effect of the peculiar circumstances 
of the group under investigation; they ought never to be relied 
upon as the sole method of comparison, but may be used as 
supplementary to the simpler method of expected deaths. 

X. 

In economic statistics it will frequently prove of interest to 
have a common measure, a "standard," in order to get a clear 
view of the main result of the changes that may have occurred. 
Thus, when comparing the mercantile marine of two countries, 
or of the same country at two different periods, it may be 
useful to assign different weights to steamers and to sailing 
vessels, multiplying the tonnage of the steamers, for instance, 
by 3, thus taking into consideration the greater carrying power 
of steamers. Or in considering the livestock of a country, a 
cow being counted as 1, we may give a horse the weight of, 
say, 3 / 2 , a pig as y 4 , a sheep as 1 / e , etc., and thus secure a rough 
estimate of the collective value of the livestock. A calculation 
of this kind may prove useful, but it cannot replace a close 
study of the original figures. The calculation may conceal 
changes which it would be of the highest interest to know 
thoroughly. Thus, in the history of agriculture, a decrease 
in the number of sheep might be offset by an increase of swine, 
and from such a computation we should learn nothing of this 
change. But having first studied the numbers in detail, we 
may get a comprehensive view by using some standard cal- 
culation. In any case the direct knowledge will always have 
a greater value than any abstraction, however ingeniously 
invented; and here, too, it may be laid down as a first principle 
always to keep the original numbers in sight as far as possible. 

A most interesting example of a standard calculation is the 
index number, the familiar instrument of the economists. 
In trying to find the purchasing power of money at two differ- 
ent epochs we may calculate the budget of a certain model 
family, for instance that of a working man with a certain 
yearly income. If this budget at one time is $500 and the 
same quantities of all the commodities concerned at a later 
time would cost, for instance, $550, then the purchasing power 



41] Scope and Method of Statistics. 265 

of money for this working man may be said to have decreased 
in the ratio 10 / n . But, if we used other budgets, we might 
get other results, so the comparison presents a defect, though 
probably the difference would generally be small. We are a 
little further from reality, if we calculate average values of 
the percentage variations of prices of different commodities, 
as is the case with most modern calculations of index numbers. 
In fact, we reduce the problem to that of using a budget in 
which each commodity enters with unity — or perhaps with a 
certain weight — a sort of abstraction from the actual budgets 
used above. It may easily be proved that a calculation of 
this kind cannot possibly give mathematically exact results. 
Let prices of two commodities at a given time be equal to 
unity, let the two prices at a later time be 0.80 and 1.25, and 
again at a third moment of time unity as before. Now taking 
our standpoint at the first moment, we find that the general 
index has increased at the second moment by 1 / a (0.25 — 0.20) 
or 2y 2 per cent. But from this moment till the third we have 
also an increase equal to \ (0.25 — 0.20) (the price of one com- 
modity increasing from 0.80 to 1.00, or 25 per cent.). Thus in 
each of the two intervals prices have increased on an average 
2 1 /i per cent. But the result of the combined movement from 
the first moment to the third will be nil, there has in fact been 
no change whatever. In practice this objection is not of great 
significance. Most economists would decidedly prefer the 
simple calculation of general index numbers by taking the 
arithmetic average of price changes instead of using a geo- 
metric average, such as recommended by the late W. Stanley 
Jevons, whose formula required operations with logarithms 
instead of with the numbers themselves. But at all events, 
whether we use one or the other formula, it will be necessary 
always to keep the various prices themselves in view. In the 
above example one commodity was becoming cheaper, the 
other, more expensive, and these facts will require close study 
before we try to find collective expressions for the changes. 

Very often the characters we want to compare are not 
quantitatively measurable. If we investigate eye color, we 
may arrange the colors in a scale from blue to brown, and give 
the first rubric of the scale the weight one. the next two, etc., 



266 



American Statistical Association. 



[42 



and calculate in this way the "average eye color" of particular 
groups. Thus, taking an instance from Galton's Family 
Eecords, we get the following results with regard to fathers 
and sons: 





Fathers Grouped According 
to Eye Color. 


Average Number for Sons. 


1 


36 

322 

264 

180 

5 

64 
101 

28 


2.7 


2 


3.1 


3 


3.4 


4 


4 


5 


4.2 


6 


4.8 


7 


5.0 


8 


5.6 






Total 


1,000 


3.7 







It will be seen from this "contingency" table that the eye 
color of the sons corresponds to that of the fathers, in that 
the eye color of the sons shows a constant progression from 
blue to brown in the same order as that of the fathers, though 
there is a certain "regression" observable, the dispersion of the 
sons being much smaller than that of the fathers. Of course 
there is something unreal in such a comparison, the characters 
not being quantitatively measurable, and it would be better 
either to consider only the original data or to modify the cal- 
culation so as to conform more clearly to the observations. 
Professor Pearson proposes a distribution according to the 
exponential law, assigning to each group a mean distance from 
the average corresponding to this law. By this method the 
following distribution is obtained : 



Group. 


Fathers. 


Sons. 


1 


1.0 
3.3 
4.8 
5.8 
6.3 
6.5 
7.0 
8.0 


3.8 


2 


4.3 


3 


4.8 


4 


5.2 


5 . 


5.6 


6 


5.7 


7 


5.8 


8 


6.4 







It will be seen that the numbers on the whole have the 
same character as before; here, too, there is a smaller disp 
sion among the sons than might have been expected. 



43] Scope and Method of Statistics. 267 

Such a method may be useful in dealing with certain prob- 
lems, as, for instance, the results of examinations with peculiar 
marks for each discipline. At all events a detailed study of 
the reports themselves will always be indispensable as a check 
upon these calculations; in fact, low marks in one study, com- 
bined with very high marks in another, will give a different 
impression from that of a uniform grade in all studies. 

XI. 

One of the methods of comparison which have of late become 
popular among statisticians with mathematical training is 
that of correlation based on Bravais's formula. A simple 
example is offered by the age distribution of brides and grooms. 
If on a piece of plotting paper we put a point for each couple, 
the age of the bride being the abscissa, that of the bridegroom 
the ordinate, then we shall find that the points are not spread 
uniformly over the paper, but that they are clustered near a 
certain line, forming a sort of milky way. We may then by 
the method of least squares find the equation of the straight 
line which may be said to give the best idea of this milky way. 
This equation being y — kx = 0, we have 2(j/ — kx) 2 = & mini- 
mum, which will require k'Sx 2 = 2yx. If we further introduce 
the expression 2x 2 = ns x 2 , n being the number of cases, fur- 
ther Sj/ 2 = ns !/ 2 and ~Zxy = ns xv = nrs xsv , where r is a quantity 
defined by the equation, we find the equation of the line in 
question : 

y = kx = r —x 

and the sum of the squares of the differences is: 

2(y — kx) 2 = ns v 2 , so that we have 
(i 2 = s v 2 (1— r 2 ). 

If r = l, all the points will lie on the straight line; the smaller 
r is, the greater will be the mean error; when there is no corre- 
lation, we shall have r = 0. This quantity r is called the co- 
efficient of correlation, whereas the quantity k is called the 
coefficient of regression. 



268 



American Statistical Association. 



[44 



The formula of correlation will prove useful in all cases where 
the points are grouped nearly around a straight line, as is the 
supposition, and this will very often hold good, as in the case 
supposed of the age distribution of brides and grooms. Still 
we must not forget that this formula removes us somewhat 
from the original data and that it does not relieve us from the 
necessity of making a close investigation of these observations. 
On the whole, the formula of correlation does not introduce any 
new principle; by tabulating and grouping the observations we 
can easily establish as a rule the fact of correlation without the 
use of the formula. To take an example from Yule's Intro- 
duction to the Theory of Statistics (1911), the percentage of 
population in receipt of poor law relief in 38 English Poor Law 
Unions of an agricultural type is correlated with the average 
weekly earnings of agricultural laborers, and we find as the 
coefficient of correlation, r= —0.66. But it is unnecessary 
to make this calculation. Grouping the districts according 
to percentage of poor law relief, we find the following numbers, 
which tell us, without any long computation, how relief and 
wages are related : 



Number of Districts. 


Per Cent Poor Law Relief. 


Average Wages 
(Shillings). 


9 


1.90 
3.21 
4.30 

5.27 


17.88 


10 


16.21 


10 


15.00 


9 


14.75 






38 


3.67 


15.95 







These numbers give us a perfectly clear idea of the connec- 
tion between wages and poor law relief. The coefficient of 
correlation will tell us nothing which cannot be seen from an 
inspection of the original numbers. But in the illustration the 
amount of poor relief is influenced not solely by the average 
wages, but also by other influences, for in each group there are 
conspicuous deviations, and to explain them other causes must 
be found. The coefficient of correlation teaches us this and 
no more. 

To take another example from Mr. Yule's book, what is the 
correlation between the fertility of mother and the fertility of 



45] 



Scope and Method of Statistics. 



269 



daughters in the British peerage? In each marriage of the 
mother generation one married daughter was chosen, and the 
number of her children put down, only marriages with a dura- 
tion of at least 15 years being taken. It will be seen that the 
numbers are not quite homogeneous, in the first generation 
each marriage taken into consideration being fertile and with 
at least one daughter, in the second generation all marriages 
being counted. 
We have the following numbers: 



Mothers. 


Daughters. 


Number of Children. 


Average 
Number of 
Children. 


Actual 
Number. 


Expected 
Number. 


Actual Number 

per 100 
Expected Children. 


1 


3.2 
3.5 
3.9 
4.1 
4.8 
5.1 
5.6 


167 
201 
905 
1,087 
976 
655 
344 


230 
247 
1,006 
1,145 
889 
555 
263 


73 


2 


81 


3-4 


90 


5-8 


95 


7-8 


110 


9-10 


118 




131 






Total 




4,335 


4,335 









It will easily be seen how these calculations have been made; 
they are in fact quite elementary; the mean error in each group 
can be found without difficulty. In this method of calcula- 
tion we have the advantage of being in close contact with the 
original observations. 

Most of the problems which we meet in statistics of families 
are of the same elementary nature, and we shall gain very 
little in dealing with them by the more complicated processes. 
We shall always have to divide the observations in order to 
find the frequency in question in each group, and, if these 
groups are too small, we may combine the results by the method 
of expected cases or by some other simple method of calcula- 
tion. The dependence of mortality of children on the age of 
the mother or the number of the birth, etc., or the rate of mor- 
tality in "phthisical" families is an example. The main 
difficulty is perhaps to find the moment in which the persons 
concerned enter into observation, for instance, when a death 
from phthisis occurs in the family. The following example is 



270 



American Statistical Association. 



[46 



taken from Lundborg, Medizinisch-biologische Familienfor- 
schungen (1913), giving details with regard to a certain epi- 
leptical disease (myoklonus) found in nine families with alto- 
gether 74 children. In this case we can take the family under 
observation after the birth of the first child suffering from the 
disease. The method will be seen from the table, in which 20 
children who died under the age of ten have been left out of 
consideration. 



Number of Family. 


Number of 
Children 
in Family. 


Number 
Diseased. 


Distribution of Children 
(the Affected Ones Signified by x). 


I 


6 
8 
6 
9 
9 
5 
6 
4 
1 


3 
1 
2 
3 
1 
2 
2 
2 
1 


1 

X 

X 
X 


2 

X 
X 

X 


3 

X 


4 

X 
X 

X 


5 

X 

X 

X 


6 

X 
X 


7 

X 


8 

X 


9 


II 




Ill 




IV 




V 




VI 




VII 




VIII 




IX 










51 


17 





In the first family the oldest child suffered from the disease; 
there were five children under observation, two of them being 
affected; in the next family the second child was diseased, but 
of the six children born subsequently none was affected, etc. 
We obtain the following result: 



Number of Family. 


Number of Children. 


Cases of Myoklonus. 




S 
6 
i 
5 
2 
3 
3 
3 



2 







Ill 


1 




2 







VI 


1 




1 


VIII 


1 















31 


8 



Thus about one out of four children suffered from this 
disease. Weinberg proposes another scheme, which involves 
a somewhat more complicated calculation, leading, however, to 
substantially the same result. 



47] 



Scope and Method of Statistics. 



271 



XII. 

Sometimes the problem is complicated by the absence of 
certain essential data; it is then frequently impossible to draw 
a safe conclusion. The question of the "handicapping" of 
first born children having been discussed of late in an inter- 
esting article in this Journal* by Louis I. Dublin and Harry 
Langman, I shall choose that problem as an example. On 
account of its great importance I shall deal with it at some 
length, discussing some Danish statistics of persons suffering 
from tuberculosis. 



Number in Family. 


Total Children. 


Number Affected with 
Tuberculosis. 


Per Cent. Affected. 


1 


3,522 

3,344 

3,043 

2,619 

2,161 

1,731 

1,296 

962 

636 

445 

841 


988 

713 

568 

427 

271 

198 

113 

81 

46 

48 

69 


28 


2 


21 


3 

4 


18 
16 


5 


13 


6 


11 


7 

8 


9 
8 


9 


7 


10 


11 




s 






Total 


20,600 


3,522 


17 







Readers of this Journal will easily understand how this 
table was constructed, by consulting the paper I have cited. 
Seven hundred and thirteen second born children, for instance, 
were affected, their first born sisters and brothers will be 
found among the 3,522 first born, this number also including 
988 first born affected with tuberculosis, 568 first born brothers 
or sisters of third born persons affected, etc. 

There is a strong temptation to conclude that the first born, 
with their 28 per cent, affected have a much greater risk of 
being attacked by tuberculosis than the second or third, not to 
mention the later children. But it is evident that these 
observations are one sided, and that essential details which 
ought to have been included are omitted from the tabulation. 
If we inquire what is the probability of children with no sisters 
or brothers being affected, we find that it is one, all the children 

*"On the Handicapping of the First Born" in American Statistical Association, Quarterly Publica- 
tions, XIV (1915), 727-735. 



272 



American Statistical Association. 



[48 



observed in this group being in fact affected. The case re- 
sembles very much the problem of constructing a mortality 
table from observations of deaths only. The present statistics 
fortunately enough can be separated according to the num- 
ber of the birth and the size of family, so that we can form 
the following table: 



Number of Affected 
Child in Family. 


Number of Children in Family of the Affected. 




1 




! 3 


4 


5 


6 


7 


8 


9 


10 


Over 10 


Total 


1 


178 


1 
1' 


>5 170 

16 134 

120 


12 
10 
10 
11 


i 107 
5 100 

} 85 
> 63 

75 


74 
77 
68 
81 
54 
81 


52 
44 
56 
60 
35 
43 
44 


45 
45 
51 
33 
38 
29 
36 
49 


26 
22 
32 
31 
19 
11 
13 
12 
25 


19 

12 

19 

18 

23 

12 

7 

5 

6 

28 


34 
27 
28 
26 
27 
22 
13 
15 
15 
20 
69 


988 


2 


713 


3 


568 


4 


427 


o 

6 

7 


271 
198 
113 


8 


81 


10.'.'.'.'.'.'.'.'.'.'.'.'..'..'.'. 

Over 10 


46 
48 
69 


Total 


178 


3 


)1 424 


45 


i 430 


435 


334 


326 


191 


149 


296 


3,522 







It will be seen from this table that 178 affected belonging 
to families with only one child have to be left entirely out of 
consideration, there being no other group of observations com- 
parable to them. Consequently we have 810 first born com- 
parable to 713 second born. Likewise we can compare 568 
third born to 713—146 = 567 second born, etc. We can 
therefore arrange the observations as follows: 



Number in 

Family. 


Number 
Affected. 


Number in 
Family. 


Number 
Affected. 


Expected 
Numbers. 




810 

567 

448 

312 

196 

117 

69 

32 

21 

60 


2 


713 
568 
427 
271 
198 
113 
81 
46 
48 
69 


712.1 




3 


561.6 




4 


420.3 




5 


305.8 




6 


219.8 




7 


147.3 




8 


99.6 




9 


58.8 




10 


37.6 










2,632 




2,534 










The expected numbers have been calculated by distributing 
301 affected persons belonging to families with two children 



49] Scope and Method of Statistics. 273 

uniformly on each number, further a third part of 424 persons 
in families with three children on each of the three birth num- 
bers concerned, and so on, for instance 

J. 301 +§.424+i. 458+ . . _ =712.1 

The two columns in the above table would be homogeneous, 
if there were no other defects (as we shall see directly that 
there are). On examination the numbers of affected are found 
to correspond with the expected numbers fairly closely; there 
is no considerable difference with the single exception of the 
first born, who indeed seem a little "handicapped" compared 
to the later children. But the numbers are not quite homo- 
geneous. To show this, let us suppose that the year of observa- 
tion is 1910 and that each affected person was 30 years old, 
so that all the persons affected with tuberculosis in the table 
were born about 1880. Now the marriages of the parents of 
these persons are from varying epochs. Probably the parents 
of the first born generally married in 1879, of the sixth born 
perhaps about 1869. Comparing now in the table on p. 48 
the numbers of the first and of the sixth child in families with 
at least six children, we find 250 against 198. But these 198 
belong to marriages which are, say, 10 years older than those 
to which the 250 first born belong. But since the population 
has been increasing steadily, we cannot compare the numbers 
till we have made an allowance for this increase. Thus the 
difference between the two numbers will be much reduced. It 
seems probable, however, that we should find some little 
difference in favor of the later born children; it must be remem- 
bered that the first born children generally have a compara- 
tively high infant mortality, so that proportionally few will 
survive, — a fact the correction for which tends to increase the 
difference. At all events the " handicapping " of the first born, 
as far as tuberculosis is concerned, must be insignificant com- 
pared to the enormous differences indicated in the table (p. 47). 

This example shows how very difficult it is to handle one- 
sided observations. We are safe only when we have data not 
only on the numbers affected by tuberculosis but also on the 
whole population among which these cases of disease have been 
observed. What is wanted is a fraction, the numerator of 



274 American Statistical Association. [50 

which gives the number of cases of disease, deaths, etc., and 
the denominator, the total number exposed to risk. Thus 
only can we draw a safe conclusion. If these required elements 
are not available, we must find whether and how far the de- 
fects in the statistics will invalidate our conclusions, by taking 
the various defects into consideration, as described above 
(pp. 17, ff.). This process, as we have seen in the present 
case, is often a very difficult task. 

XIII. 

Returning now to the problem of dealing with observations 
of characters which are not quantitatively measurable, I shall 
borrow some figures from a paper by Professor Pearson in 
Biometrika (III, 1904). One thousand seventy six pairs of 
sisters in a school were characterized by the teacher according 
as they were "quick," "good natured" or "sullen." The 
figures were as follows: 

99 pairs, both sisters quick. 
498 pairs, both sisters good natured. 

60 pairs, both sisters sullen. 
177 pairs, one sister quick, one good natured. 

77 pairs, one sister quick, one sullen. 
165 pairs, one sister good natured, one sullen. 

Here again it seems to me that the data require only an 
elementary treatment in order to give full evidence as to the 
influences concerned. We have among 2,152 children 452 
reported as quick, 1,338 as good natured and 362 as sullen (or 
210, 622, 168 per 1,000): consequently according to the cal- 
culus of probabilities 44 per 1,000 of the pairs should be both 
of them quick, 261 should have one quick and one good 
natured sister, etc. We can construct the following table : 



Both sisters quick 

Both sisters good natured 

Both sisters sullen 

One sister quick, one good natured 

One sister quick, one sullen 

One sister good natured, one sullen 

Total 1,076 1,076 



Expected 
Number. 


Actual 
Number. 


48 


99 


416 


498 


30 


60 


281 


177 


76 


77 


225 


165 



51] 



Scope and Method of Statistics. 



275 



We see at once that the sisters are very often alike, 657 pairs 
having the same temperament against 494 as expected. It is 
not difficult to find the mean error in order to judge how much 
of the difference between the expected and actual cases is due 
to other causes than chance variation. 

Professor Pearson prefers another way of dealing with this 
problem, arranging the experience in the following table : 





First Sister. 


Second Sister. 


Quick. 


Good 
Natured. 


Sullen. 


Total. 




198 
177 

77 


177 
996 
165 


77 
165 
120 


452 




1,338 


Sullen 


362 






Total 


452 


1,338 | 362 


2,152 



We find thus each pair twice, in 77 cases one sister is quick, 
the other sullen; these are tabulated as 77 quick sullen and 77 
sullen quick, etc. Giving now each group a separate weight 
we may calculate a coefficient of correlation. The result will 
be as above, but the calculations will require much more time 
than the more elementary procedure, and the results do not 
present any clearer view of the influences in question. We may 
simplify the table by reducing the number of columns, for 
instance, by adding the good natured and sullen, so that we 
get the following result : 





First Sister. 


Second Sister. 


Quick. 


Good Natured 
or Sullen. 


Total. 




198 
254 


254 
1,446 


452 




1,700 




Total 


452 


1,700 


2,152 





But even thus we shall have more calculations than in the 
above elementary solution of the problem. 

Sometimes even these contingency calculations increase 
the distance between the original observations and the cal- 
culated results to such an extent that we should decidedly 



276 American Statistical Association. [52 

prefer the elementary calculations. Thus, in a recent review 
in the Journal of the Royal Statistical Society (Jan. 1916) of 
The Material Culture and Social Institutions of the Simpler 
Peoples, by Hobhouse, Wheeler, and Ginsberg, it is objected 
that a coefficient of contingency ought to have been calculated 
instead of confining the description of the conditions of these 
people to the simple statement that 47 per cent, of the "lower 
Hunters," 25 per cent, of the "higher Hunters," and 10 per 
cent, of the lower "pastoral" group fall into the category, 
"government slight or nil." It seems to me that this com- 
parison is quite sufficient and that it would be waste of time 
to proceed to further calculations. 

If I am right in these remarks, statistics of this kind can gen- 
erally be treated in quite an elementary way, and no new 
results can be found by using more complicated methods. The 
principal problem will always be to find the relative frequency 
of the event or the character which we want to investigate, 
and first of all our object must be not so much to prepare 
refined statistical methods as to provide useful observations. 
The special nature of the observations may then lead to mod- 
ifications of the formulas, to a more complete and refined 
theory, but at present we are more in need of statistical data 
than of theoretical investigations. This may sound curious, 
in view of the immense mass of details which are published 
every day by the numerous statistical institutions all over the 
world, but to a great extent all these reports are repetitions, so 
to speak, of older investigations, most of them made in a 
single mold. We want statistical observations covering new 
fields, and here, as shown above, an enormous amount of work 
remains to be done. Of course we cannot do without these 
myriads of statistical volumes; they have, at least the great 
bulk of them, their local claim to exist; — but beyond these 
reports numerous problems are waiting to be solved and the 
solution will require much patience and much careful work in 
gathering the necessary materials. As far as I can see, the 
future of statistics will depend on the energy with which these 
problems are taken up. 



